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Abstract 



Maximum marginal likelihood estimation of multidimensional item response theory (IRT) 
models has been hampered by the calculation of the multidimensional integral over the ability 
distribution. However, the researcher often has a specific hypothesis about the conditional 
(in)dependence relations among the latent variables. Exploiting these relations may result in 
more efficient estimation algorithms. A well-known example is the bi-factor model, in which 
each item measures a general dimension and one of K other dimensions, for which Gibbons and 
Hedeker (1992) showed that full information maximum likelihood estimation only requires the 
integration over two-dimensional integrals. In this paper, it is shown how the approach of 
Gibbons and Hedeker (1992) can be placed into a graphical model framework. The advantage of 
the graphical model framework is that efficient estimation schemes can be derived in a fully 
automatic way by applying algorithms to the graphical representation of a statistical model. This 
renders the approach fairly generally applicable, and tedious derivations by hand are no longer 
involved. The generality of the approach is demonstrated by applying it to a multidimensional 
IRT model with a second order dimension. It turns out that full information maximum likelihood 
estimation for such a model also requires the evaluation of two-dimensional integrals only. 

Key words: Item response theory, MML estimation, graphical model framework, estimation 
schemes 
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Maximum marginal likelihood (MML) estimation of multidimensional item response 
theory (IRT) models has been hampered by the fact that the computations involve a numerical 
integration over the latent ability distribution. With increasing dimensionality, brute force 
integration becomes computationally demanding to a degree that the applicability of higher 
dimensional IRT models is jeopardized in practical settings. More specifically, when the integral 
over the ability distribution is evaluated using Gaussian quadrature (e.g., Bock & Aitkin, 1981), 
the number of calculations involved increases exponentially with the number of dimensions. Even 
though the number of quadrature points per dimension can be reduced when using adaptive 
Gaussian quadrature (Pinheiro & Bates, 1995), the total number of points again increases 
exponentially with the number of dimensions. Furthermore, adaptive Gaussian quadrature involves 
the computation of the posterior mode and variance of the latent distribution for each response 
pattern, with a level of complexity that increases with the number of dimensions as well. 

As an alternative, so-called limited information techniques can be used to estimate the 
parameters of multidimensional IRT models (Joreskog, 1994; Muthen, 1984). Limited 
information techniques have been developed in the field of structural equation modeling to deal 
with ordered categorical observed (indicator) variables. Because many IRT models can be 
formulated as confirmatory factor analysis models for ordinal data (Skrondal & Rabe-Hesketh, 
2004; Takane & de Leeuw, 1987), limited information estimation methods can be used to 
estimate the parameters of these IRT models as well. Unlike MML estimation methods, the 
limited information techniques do not take into account the complete joint contingency table of 
all items, but only marginal tables up to the fourth order (Mislevy, 1985). In this way, parameter 
estimation can be carried out using weighted least squares estimation and is reasonably fast, even 
for high-dimensional models. However, the number of elements in the optimal weight matrix, 
which has to be invertible, grows with the fourth power of the number of items (Mislevy, 1985), 
so that the sample size needed to estimate an IRT model with many items, which is the typical 
situation in educational measurement, becomes prohibitive in many practical applications. 
Alternatively, Muthen, du Toit, and Spisic (1997) proposed a robust weighted least squares 
approach where the optimal weight matrix is replaced by a diagonal matrix, having as elements 
the diagonal elements of the optimal weight matrix. 

IRT models that do not incorporate item discrimination parameters, such as the Rasch 
(1960) or the partial credit model (Masters, 1982), can be formulated as generalized linear mixed 
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models (Rijmen, Tuerlinckx, De Boeck, & Kuppens, 2003). For those models, quasi-likelihood 
methods (Breslow & Clayton, 1993) have been developed as relatively fast alternatives to MML 
estimation methods. Research showing that the early versions of these methods resulted in 
inconsistent estimates has led to attempts to improve upon these methods (Goldstein & Rasbash, 
1996; Rodriguez & Goldman, 1995). 

Notwithstanding the widespread use of limited information and quasi- likelihood 
estimation techniques and ongoing efforts for further improvements in these methods, one can 
safely assume that many researchers would prefer or at least consider using full-information 
MML techniques if they were available. In this paper, it is shown that full information MML 
estimation is feasible in many cases by exploiting the conditional independence relations that are 
implied by the model. Instead of using brute force integration over the joint distribution of all 
latent variables, the marginal likelihood and other quantities used during estimation can be 
computed by carrying out a sequence of integrations over subsets of latent variables. The 
particular collection of subsets depends on the conditional independence relations implied by the 
model. The level of computational complexity of a full information MML procedure that exploits 
the conditional independence relation implied by the model scales with the number of latent 
variables within a subset, which may be substantially lower than the total number of latent 
variables. 

Graphical model theory offers a general framework for deriving these subsets of variables 
and also provides the sequence in which to carry out the integrations. A main advantage of 
adopting the graphical model framework is that these subsets of variables can be derived in a 
fully automatic way by applying algorithms to the graphical representation of the model. This 
renders the approach fairly generally applicable, and complicated or tedious derivations done by 
hand become obsolete. 

In the next section, the bi-factor model is presented. The bi-factor model is an interesting 
model for many practical testing situations in that it incorporates both a general dimension that is 
common to all items and specific dimensions that pertain to subsets of items only. A second 
reason to start with the bi-factor model is that Gibbons and Hedeker (1992) showed how full 
information MML estimation in this case only requires the evaluation of integrals of dimension 
not higher than two. 
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Using the bi-factor model as a leading example, the next section of the paper will show 
how efficient estimation algorithms can be constructed by relying on graphical models. For the 
bi-factor model, the procedure involves the same set of two-dimensional integrations as specified 
by Gibbons and Hedeker (1992), but the result is obtained under much more general conditions. 
In the following section, the generality of the graphical model framework will be illustrated with 
a multidimensional model that incorporates a second-order dimension to account for the 
correlations between dimensions. 



The Bi-Factor Model 

In the bi-factor model, each item is an indicator of a general dimension and one of K 
other dimensions. Typically, the general dimension stands for the latent variable of central 
interest (e.g., reading ability), whereas the K other dimensions are incorporated to take into 
account additional dependencies between items belonging to the same subset (e.g., items of a 
reading test referring to the same reading passage). That is, conditional on general ability, items 
are assumed to be independent between but not within subsets. 

For binary data, the bi-factor model can be defined as follows. Let yj(k) denote the binary 
scored response on the j* item , 7 = 1,. . .., 7, embedded within item subset k,k= 1,..., K. There 

K 

are Jk items embedded within each subset k, hence ^ 7^, = 7 . The response vector pertaining to 

k=l 

item subset k is denoted by y^, and the vector of all responses is denoted by y. Conditional on K 
subset specific latent variables 6 *^ and a general latent variable 0^ that is common to all items, 

the responses are assumed to be statistically independent, 

P(y|e) = np(j-„., (1) 

7=1 



where 0 = (^ 6*^, ^1,..., 6*^,..., 6*^^ . Furthermore, n. =\\6^,0^^ is related to a linear 

function of the latent variables through a link function g (•) , 

= (2) 
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where ^ (•) is typically the probit or logit link function. The parameter is the intercept 
parameter for item j, anda^.^ and a.^. are the slopes or loadings of item j on the general and 

specific latent variables. Note that several distinct but formally equivalent parameterizations are 
being used in the IRT and factor analysis literature for the model presented in Equation 2. 

The slope parameters a ^ are assumed to be known in the so-called one parameter IRT 

models. When an item guessing parameter is also incorporated into the expressions for the ’s, 
a so-called three parameter IRT model is obtained. Furthermore, for polytomous responses, the 
model can be extended in a straightforward way by choosing a link function g (•) for 
polytomous data (Fahrmeir & Tutz, 2001). 

In order to identify the model, the location and scale of all dimensions have to be fixed. 
Typically, the mean and variance of each dimension are set to zero and one, respectively. In 
addition, K restrictions are needed stemming from the invariance of the model. This can be 
achieved by setting the correlations between each of the K specific dimensions and the general 
dimension to zero. Further details on identification are provided in Appendix A. 

The testlet model (Bradlow, Wainer, & Wang, 1999) is a special case of the bi-factor 
model. It is obtained by constraining the loadings on the specific dimension to be proportional to 
the loadings on the general dimension within each testlet (Fi, Bolt, & Fu, 2006; see also 
Appendix A). 

In a MMF estimation framework, the latent variables are assumed to be random variables 
with joint distribution r (0) . Then, the marginal probability of a response pattern is obtained by 
integrating out the latent variables, 

R(y) = jp(y|0)r(0)d0. (3) 

e 

When the latent variables are discrete rather than continuous, the integral in Equation 3 is 
replaced by a finite sum. 

MMF estimation, whether through a direct maximization of the likelihood (i.e., Newton- 
Raphson or quasi-Newton) or an indirect maximization (i.e., the EM algorithm), requires the 
evaluation of Equation 3 for each response pattern. Since there is no known closed-form 
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solution, one relies on numerical integration techniques, approximating P{y) as a finite sum. 

The computational complexity of brute force numerical integration becomes very high with an 
increasing number of dimensions. With a fixed number M of points for each latent variable for 
example, the number of terms in the finite sum amounts to M^, where D is the number of 
dimensions (D = K+\ for the bi-factor model). 

Fortunately, the computation of the marginal probabilities P(y) and other quantities that 

are used during estimation do not require numerical integration over a D-dimensional grid of 
points, if one is willing to make some assumptions with regard to the distribution of the latent 
variables. Exploiting these assumptions and the conditional independence relations implied by 
the bi-factor structure, E(y) can be obtained through a sequence of integrations in two 

dimensions. More specifically, the integrations involve the sets = 1,..., K. 

Gibbons and Hedeker (1992) derived this result for the bi-factor model under a specific 
set of conditions: 



g (•) is the probit link 
0~ A^(0,I) , where I is an identity matrix. 

These conditions arise from the fact that Gibbons and Hedeker (1992) base their derivation on a 
property of the multinormal integral (Stuart, 1958). For polytomous responses, see Gibbons et al. 
(2007) for a derivation along the same lines. 

In the next section, it is shown that P{y) can be obtained efficiently through a sequence 
of integrations over the sets (^6*^ , 6(. ^ under much more general conditions. More specifically, the 

derivation holds for any link function rather than being limited to the probit link; the distribution 
of 0 is not assumed to be multivariate normal; and finally, the assumption of independent latent 
variables is relaxed to the assumption that all specific latent variables are conditionally 
independent given the general latent ability. 

The relaxation of the last assumption deserves some further discussion. As is discussed in 
Appendix A, some constraints have to be imposed in order to identify the bi-factor model. When 
all latent variables are assumed to be independent (as in Gibbons & Hedeker, 1992), the model 
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can be identified by constraining the means and variances of the latent variables. When relaxing 
the assumption that all latent variables of the bi-factor model are independent into the 
assumption that the specific latent variables are conditionally independent, given the general 
latent ability, K additional identification restrictions are needed. One way to identify the model is 
to constrain the correlations between the specific and general ability to zero. In the multivariate 
normal case, a correlation of zero implies statistical independence. Hence, it may seem there is 
no point in relaxing the assumption that all latent variables of the bi-factor model are 
independent into the assumption that the specific latent variables are conditionally independent, 
given the general latent ability. However, the argument that is made in this paper is that, 
regardless of what constraints are needed for model identification, the derivation of Gibbons and 
Hedeker (1992) requires independent latent variables, whereas the derivation in the next section 
only requires conditional independence. Furthermore, a correlation of zero implies statistical 
independence for the normal distribution, but not for an arbitrary distribution r(0) . Finally, the 

researcher may have reasons to identify the bi-factor model by restrictions other than imposing 
uncorrelated dimensions. 

Efficient Computations Based on Graphical Modeling 
Statement of the Problem 

The efficient MML estimation method that will be presented for the bi-factor model in 
this section and for a second-order factor model in the subsequent section are essentially specific 
instantiations of a general expectation maximization (EM) algorithm that has been developed in 
the field of graphical modeling (Lauritzen, 1995). The algorithm is efficient in that the E-step is 
carried out using local computations on subsets of variables that are conditionally independent. 
These sets are derived by working from the graphical representation of a statistical model. A 
thorough account of the general procedure involves a substantial amount of graph theory and is 
outside the scope of this paper. The main results will be stated without proof. The interested 
reader is referred to Cowell, Dawid, Lauritzen, and Spiegelhalter (1999) for a more in-depth 
account. Instead, a more intuitively based account is presented using the bi-factor model as a 
leading example. 

After computing the relevant quantities in the E-step in an efficient way, parameters can 
be updated in the M-step in the same way as in a traditional EM algorithm that relies on brute 
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force integration in the E-step (see Bock, Gibbons, & Muraki, 1988, for an EM algorithm for the 
multidimensional IRT model; or Eahrmeir & Tutz, 2001, for a description of an EM algorithm 
for generalized linear mixed models). 

As a starting point, consider how an EM algorithm proceeds for the bi-factor model. The 
E-step consists of the computation of the expected complete data log likelihood, where the 

expectation is taken over the posterior distribution < 7^6 |y ; j of the latent variables given the 
data and the vector of provisional parameter estimates : 

2(a;d'’“) = £'{logL^(y,0;a)|y;a"“}, 

= XjlogP(y|0)^(0|y;r“)d0+Xjlogr(0)^(0|y;d'"‘')d0 (4) 

i 0 i Q 

The first term, using the bi-factor structure can be rewritten as 
XjlogR(y| 0 )^( 0 |y;r“)d 0 

i 0 

i e k 

= Z Z I >“8 '’ ( « (® k 

i k e 

= ZZ i I >“8 p(y, \0,0,)q[0,A |y;«'" y0>d0, 

where 0^^^ is the vector of subset specific latent variables less 6^ . Hence, the first term of 
Q{^a;a°‘^ ^ , which is the quantity to be maximized with respect to the parameters during the 

subsequent M-step, involves a sequence of bidimensional integrations over the sets j . 

Such a reduction does not hold in general for the second term. It is easily verified, however, that 
integrations over the same sets (^ 6 *^ , 6 *^ ^ are obtained when incorporating the assumption that the 
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specific latent variables are conditionally independent of each other given the general latent 
variable, 



k 



( 6 ) 



Through algebraic manipulations, the very important result is obtained that Q^a;a°‘‘‘ ^ 

only contains two dimensional integrals over the sets , 6*^ ^ . Note that these are the same sets 

as the ones obtained by Gibbons and Hedeker (1992). One is still short of proving their result 
under more general conditions, however. The reason is the computation of the posterior density 

q{^6^,6^ |y; ) • The default but cumbersome way to compute q{^6^,6i, |y; d"'"' j would be 
through an application of Bayes’ theorem. 



jjr(e,.e^)p{y\e,.e,)de,de, 



where = can be obtained by marginalizing over the K-\ other 

latent variables, 

1 ( 8 ) 

®(‘) 

Consequently, a brute force computation of the denominator in Equation 7 still requires the 
(numerical) evaluation ofdiK+ 1 -dimensional integral. 

To conclude, in order to be able to exploit the conditional independence relations implied 

by the bi-factor model, a procedure is needed to compute the posterior densities q[6^,6^ |y; d"“ j 

in a way that is more efficient than first carrying out numerical evaluations on a .^T-i-l dimensional 
grid, and subsequently collapsing the resulting table over K-\ dimensions. This is where 
graphical modeling comes into the picture. 
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Directed Acyclic Graphs 

Even though many useful results of graphical modeling apply to models that incorporate 
both discrete and continuous (normally distributed) variables (Lauritzen, 1996), a condition for 
models represented by a directed acyclic graph (DAG) is that continuous parents should not have 
discrete children. This condition is obviously not met in the bi-factor model. Therefore, we will 
approximate each latent variable 6* by a discrete latent variable z, with, as equivalents of 
Equations 1 and 6, 

/’(y|z) = np(y,(,)|z^,z,) (9) 

y=l 

'’(")='’E)nf(^.k)' (10) 

k 

As a matter of fact, replacing the vector of continuous latent variables 0 with a vector of 
discrete latent variables z is tantamount to what is done when evaluating the integral over 0 
using numerical integration in traditional MME estimation procedures. That is, from a 
computational viewpoint, there is no difference at all between having 0 in the model 
formulation and approximating the integrals over 0 through numerical integration over a 
discrete grid z on the one hand, and approximating the model through the estimation of its 
discrete counterpart incorporating z on the other hand. By estimating E(z) (and z) from the 

data, any distribution r(0) can be approximated (Aitkin, 1999; Eaird, 1978). When 0 is assumed 
to be multivariate normally distributed, both the values z and their probabilities P{^) are pre- 
specified using the formulas for Gaussian quadrature (Abramowitz & Stegun, 1974), and then 
centered and scaled separately for each score pattern based on the maximum and variance of the 

posterior distribution |y;d"“ j in the case of adaptive Gaussian quadrature (Pinheiro & 

Bates, 1995). 

The starting point is to represent the bi-factor model (with discrete latent variables z) as a 
DAG. DAGs have been used for a long time to represent statistical models, especially in the 
structural equation and factor analysis literature. Eigure 1 (top panel) represents the DAG of the 
bi-factor model characterized by Equations 9 and 10 for a model with four subsets of items and 
three items within each subset. In the graph, each node corresponds to a random variable, and the 
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z 




z 

0 




Figure 1. Directed acyclic graph of the bi-factor model with four subsets of items and three 
items within each subset. Top: Individual items represented as nodes. Bottom: Item subsets 
represented by a single node. 
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(absence of) directed edges between nodes represent conditional (in)dependence relationships. 

For example, the directed edge between and Zj ( Zg is called a parent of Zj ) represents the 

conditional dependence of Zi on z^ , whereas the conditional independence between Zi and Z 4 is 

represented by the fact that both are children of z^ , and that they are in no other way connected. 

The items have a special status in that they appear as terminal nodes in the DAG. As explained in 
Rijmen, Vansteelandt, and De Boeck (2008), the DAG (and its transformation into other types of 
graphs) can be simplified by collecting the set of observed variables that share the same parents 
and appear as terminal nodes into a single node (see Figure 1, bottom panel). 

A DAG associated with a probabilistic model for a set of discrete random 
variables Vj , . . . , admits a recursive factorization of the joint probability function, 

M 

P{x) = Y[p[x^\pa[xJ), ( 11 ) 

m=l 

where pa(xm) denotes the set of variables that are parents of Xm- Applying Equation 1 1 to the DAG 
in Figure 1, one indeed obtains the factorization of the probability of the complete data vector 

R(y,z) = P(y |z^P(z) according to the bi-factor model characterized by Equations 9 and 10. 

Note that adding edges in the DAG of Eigure 1 still results in a valid factorization of the 
joint probability function. However, some conditional independence relations will no longer be 
represented in the DAG. In general, adding unnecessary edges results in loss of efficiency of 
computational schemes such as the one described below. 

Transforming the Directed Acyclic Graph (DAG) to a Junction Tree 

The core of the construction of efficient computational schemes relies on the 
transformation of a DAG into a junction tree. Eor a detailed account of the algorithms for 
transforming a DAG into a junction tree, see Cowell et al. (1999). I suffice by presenting the 
transformations with minimal details. 

A first step is transforming the DAG into an undirected graph. The undirected graph is 
called the moral graph. It is obtained by adding an undirected edge between all nodes with a 
common child that are not yet joined, and dropping directions from all edges. Eigure 2 displays 
the moral graph of the directed acyclic graph of Eigure 1 . No edges have to be added to the DAG 
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of the bi-factor model, because there was already an edge between Zg and Zi, Z2, Z3, and Z4, 
respectively, with which Zg has respectively Fi, Y 2 , F 3 , and F 4 ,each as a common child. 



z 




Vi Y2 y0 



Figure 2. Moral graph for the bi-factor model with four subsets of items. 

Note that, were independence between all latent variables assumed, as in Gibbons and 
Hedeker (1992), rather than the more lenient assumption that the specific dimensions are 
conditionally independent given the general dimension (Equation 10), the same moral graph 
would have been obtained. Specifically, in that case, there would be no edges in the DAG from 
Zg to Zi, Z2, Z3, and Z4, respectively. However, these edges would have been added when 
forming the moral graph. The fact that both DAGs result in the same moral graph is precisely 
why the assumption of independent latent variables can be relaxed into the assumption that the 
specific latent variables are conditionally independent given the general latent variable without 
sacrificing computational efficiency. 

The moralization step ensures that a probabilistic model that satisfies the conditional 
independence relations implied by a DAG also satisfies the conditional independence relations 
implied by the undirected moral graph of the DAG. In the process of moralization, conditional 
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independence relations that were implied by the DAG might lose their representation in the 
moral graph by the process of adding edges. 

Second, the moral graph is triangulated by adding edges so that chordless cycles contain 
no more than three nodes. A chordless cycle is a cycle in which there are only edges between 
consecutive nodes. In general, a triangulated graph can be obtained in many different ways, but 
one tries to add as few edges as possible to retain the graphical representation of the conditional 
independence relations that were implied by the DAG. Finding an optimal triangulation is 
nondeterministic polynomial-time (NP) hard (Yannakakis, 1981; for the reader not familiar with 
computational complexity theory, NP-hard is very hard), but well performing heuristic 
algorithms are available (Kjaerulff, 1992). The moral graph in Figure 2 contains no cycles and 
thus is already triangulated. 

A graph being triangulated is a necessary and sufficient condition for the existence of an 
associated junction tree. A tree is a graph whose undirected version (obtained by dropping all the 
directions from the edges) has a path between all pairs of nodes and has no cycles. In a junction 
tree, the nodes correspond to cliques. Cliques are complete subsets of nodes. A set of nodes is 
complete if there is an edge between every pair of nodes. The intersection between two 
neighboring cliques Ck and Ci is called a separator. Ski = Ci^r\Cj. 

A junction tree possesses the running intersection property: the intersection n C, of a 

pair Ck, Cl of cliques is contained in every node on the unique path in the junction tree between 
Ck and C/. Figure 3 shows a junction tree of cliques obtained from the triangulated moral graph 
of Figure 2. Again, more than one junction tree can be constructed in general. 



Vi 

'1 1 g 



^2 ^2 



^2, h 



Va Za z 
'4 4 o 



Figure 3. Junction tree for the bi-factor model with four subsets of items. 
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Junction Tree Algorithm 

A crucial result is that a junction tree offers an alternative factorization of the joint 
probability function. In particular, the joint probability function of all variables can be factorized 
as the product of all marginal clique probabilities over the product of all marginal separator 
probabilities: 



P(x) 



ric-p(^c) 



( 12 ) 



where Xc and x^ denote the random variables that constitute clique c and separator s, respectively. 
For the bi-factor model, it is verified in Appendix B that the probability of the complete data 
vector P(y,z) can indeed be rewritten as 



^(y,z) 



k 




(13) 



The cliques correspond to the sets {yu, Zk, Zg), and are exactly the sets of variables that 
appeared in the individual terms of Q (^a; ^ , the expected complete data log-likelihood for the 

bi-factor model (see Equation 5). Hence, computing using the factorization of 

Equation 13 results in the same dimensionality reduction as was obtained in Equation 5 through 
algebraic manipulations. More importantly, graphical modelling theory offers an efficient way 

for computing the posterior probabilities E ^ |y ; j , as is explained in the following. 

The factorization of Equation 12 serves as the basis for an efficient computational 
scheme using local computations. A first step is to associate a non-negative potential function \|/ 
to each clique and separator of the junction tree. The domain of \|/ is the set of all possible 
realizations of the random variables in the clique or separator. 

The potential values or potentials may be assigned as follows. Eirst, all potentials are 
initialized with value 1. Then, each factor of Equation 1 1 is multiplied into the potential function 
of a clique that contains all the nodes corresponding to the random variables in the factor. The 
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way a junction tree is constructed implies that there is always such a clique; if there is more than 
one, it does not matter which one is chosen. Then, by definition. 



P{x) 



ric^(^c) 



(14) 



Next, a schedule of flows is passed along the edges of the junction tree. Let Ct and Ci be two 
consecutive nodes of the junction tree, with separator Sti . New potentials are defined as 



V (XsJ= Z v(XcJ, 

c,\s„ 

and 



where y, X(^x^^ j = 0 if V|/(Xs^ ) ^ 0, 

^ V / 



(15) 



and ^ denotes marginalization over all random variables whose corresponding nodes are in 

c*\s„ 

Ck\St 

Jensen, Lauritzen, and Olesen (1990) proposed the following two-phase schedule: First, 
select an arbitrary clique of the junction tree as the root-clique. In the collection phase, flows are 
passed along the edges towards the root-clique. In the distribution-phase, flows are passed in the 
reverse direction. See Figure 4 for a scheduling of flows. 

After applying the two-phase schedule, equilibrium is reached, and the clique and 
separator potentials correspond to the marginal probability functions of the cliques and 
separators, respectively. Numerical underflow can be avoided by normalizing the potentials of 
each clique after updating the potentials of that clique, for example by dividing each potential by 
the sum of the clique potentials. After applying the two-phase schedule, the clique and separator 
potentials then become proportional to the marginal probability functions of the cliques and 
separators, respectively. The proportionality constant equals the product of the clique 
normalizing constants. 
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Figure 4. Scheduling of flows along the junction tree for the bi-factor model with four 
subsets of items. 

When some of the variables are observed, posterior probability distributions of the 
unobserved variables can be obtained along similar lines. The only change is that, for each 
observed variable, some arbitrary clique that contains the variable is selected, and the potentials 
of all realizations of the clique variables involving a different state for the observed variable than 
the observed one are set to 0. This step is called “entering the evidence.” After applying the two- 
phase schedule, the clique and separator potentials now are proportional to the posterior marginal 
probability functions of the cliques and separators, respectively. Normalizing the potentials of a 
clique or separator so that they sum to one yields its posterior probability distribution. Posterior 
distributions for individual variables are obtained by marginalizing over all other variables in the 
clique or separator. 

For the bi-factor model, the observed variables are the items, and they all appear as 
terminal nodes. As mentioned before, the subsets of items sharing the same set of parents can be 
merged into a single node in this case. The corresponding DAG is much simpler, and 
consequently the construction of a junction tree and the propagation of evidence along the 
junction tree are also simplified considerably. The effective state space of such a merged node is 
only of size one and equals, for each case, the observed response pattern on the corresponding set 
of observed variables. This is because, when entering evidence, the potentials of all 
configurations other than the observed one are set to zero. Including these configurations would 
only result in a needless propagation of zeroes (Huang & Darwiche, 1996). Initialization is done 
in the same way as described before, except for the fact that now, for each case, the conditional 
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probabilities of the observed response patterns on the sets of terminal nodes sharing the same 
parent(s) are used. These conditional probabilities are computed over all data that are not missing 
(assuming ignorable missingness; Little & Rubin, 1987). If all data on such a set of observed 
variables are missing, this probability is set to 1, leaving the potentials unaltered. 

In summary, by applying transformations to the graph, all conditional independence 
relations of the probabilistic model could be rendered explicit that did not lose their 
representation in graphical form during the process of moral i z ati on and triangulation. The 
factorization of the joint probability function along these conditional independence relations will 
allow for computations to be carried out locally to yield marginal or conditional posterior 
distributions of interest, given observed data. 

This is a very important result, because it shows how the E-step of the EM algorithm can 
be carried out efficiently (Eauritzen, 1995). Eirst, the junction tree is initialized using provisional 
parameter estimates . Then, separately for each case, the observed response pattern is 
entered as evidence into the network. After propagating the evidence along the junction tree, the 
posterior clique probabilities are obtained. These are the posterior probabilities needed to 
compute the expected complete data log-likelihood. In the M-step, updated 

parameters are obtained by maximizing with respect to the parameters. 

The complexity of the E-step of this efficient EM algorithm scales with the sum of the 
clique state spaces. Hence, the smaller the clique state spaces, the greater the gain in efficiency 
obtained by using the modified instead of the standard EM algorithm. 

Eor the bi-factor model, the cliques of the junction tree correspond to the sets (y^:, Zk, Zg), 
which are exactly the sets of variables that appeared in the individual terms of ^ as 

defined in Equation 5. Hence, after propagating the evidence along the junction tree, the 
(renormalized) posterior marginal clique probabilities correspond to the posterior probabilities 

P{^Zg,Zi, j . It follows that an efficient full information MME estimation method, involving 

a sequence of two-dimensional integrations over the sets can be derived under much 

more general conditions than the ones stated by Gibbons and Hedeker (1992). Appendix B 
contains a more detailed description of how the E-step of the EM algorithm is carried out 
efficiently for the bi-factor model. A main advantage of graphical modeling is that it offers a 
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very general procedure that can be carried out in an algorithmic way, rendering tedious algebraic 
manipulations of the likelihood function obsolete. 



A Multidimensional Model With a Second-Order Dimension 

In this section, the generality of the approach is illustrated with a multidimensional IRT 
model incorporating a second-order dimension. In the model, all dependencies between the first- 
order dimensions are explained by the second-order dimension. The DAG for a model with four 
first-order dimensions is given in Figure 5. The absence of edges between the first-order 
dimensions represents their conditional independence given the second-order factor. The moral 
graph is simply obtained by dropping the directions of the edges, and a junction tree is given in 
Figure 6. From Figure 6, it follows that an efficient E-step of the EM algorithm involves a 
sequence of two-dimensional integrals over the sets of latent variables (zk, Zg), k= 1,. . ., 4, as was 
the case for the bi-factor model. As a matter of fact, comparing Eigure 5 to Eigure 1 reveals that 
the second-order model is a bi-factor model with the additional restriction that the conditional 
item response probabilities do not directly depend on the general dimension. 




Figure 5. Directed acyclic graph for a four-dimensional model with a second-order 
dimension. 
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Figure 6. Junction tree for a four-dimensional model with a second-order dimension. 

Discussion 

Graphical models have been used widely to represent statistical models in a visually 
attractive way. In this paper, it is argued that graphical modeling has more to offer. It was 
explained in some detail how efficient MML estimation procedures can be developed by 
applying transformations to the directed graph representing the statistical model. Working in a 
graphical modeling framework has the advantage that, once the statistical model has been 
represented in a DAG, the transformations can be applied in a fully algorithmic way, and the sets 
of conditionally independent variables are obtained automatically. 

In this paper, attention was limited to multidimensional IRT and item factor analysis 
methods, because these methods represent an important class of statistical models in the 
quantitative social and behavioral sciences. Moreover, especially in the IRT community with its 
focus on full information MML estimation, the applicability of multidimensional models has 
long been considered rather limited because of the intractable multidimensional integral that 
appears in the marginal probability of a response pattern. The important result that was 
established in this paper is that this dismissal of multidimensional models because of 
computational considerations has not been entirely fair with respect to such models. As long as 
one is willing to impose some structure on the model, the estimation can be simplified by 
exploiting the imposed conditional independence relations. 

When one is not willing to impose any conditional independence relations, full 
information MML estimation still becomes infeasible with an increasing number of dimensions. 
However, such models are mostly considered in an exploratory stage only, after which typically 




a simple structure model is constructed for (a subset of) the items. In the exploratory stage, 
obtaining full information MML results is not of crucial interest. The following sequential 
procedure could be followed: first, several models could be explored using limited-information 
estimation techniques. Once a good candidate model is established that incorporates a simplified 
structure, the parameters of this model can be estimated using the efficient EM algorithm. 
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Appendix A 

Identification Restrictions for the Bi-factor and the Testlet Model 



The bi-factor model was defined in Equation 2 as 



(Al) 



Identification Restrictions 

There are three types of identification restrictions: 

1 . ^ -I- 1 restrictions to fix the origin of the general and K item subset specific 
dimensions. In a MML framework, this can achieved through the restrictions 
E(0) = O. 

2. K + \ restrictions to fix the scale of the general and K item subset specific 
dimensions. In a MML framework, this can achieved through the restrictions 
<Jg =h<Jt =^,k = \,...,K . 

3. K restrictions to deal with the rotational invariance. Algebraically, rotational 

invariance can be shown as follows. Assuming that the location and scale are already 
fixed through E (0) = 0 and cr^ = 1, = 1, k = 1, ..., , we rewrite Equation Al , 

= + ^A + A + - ^>^^A 

= («.. - ) + Pj (A2) 

= ale^+a,a.,[e,+c,e^)la,+P^ 

= aA^a],eA/3, 

with a]^ =(a.^ -c, «.,),«*, = a, a.,, 6l=[e, +c,e^)la„ and af=l-tc>2c,p^,. a] is the 
variance of and is determined by the subset specific constant c^, and the correlation 

between 6^ and 6^ . Dividing (^6*^ ^ by its standard deviation ensures that the 

variance of 0l is 1. We obtained again the expression for a bi-factor model, but now expressed 
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in a different basis. The rotational invariance can be fixed by setting the correlations to zero 
for all k. 



The Testlet Model 

Bradlow et al. (1999) formulated the testlet model in a Bayesian framework. Its analogue 
in a maximum likelihood framework for a model without guessing parameter can be formulated 
as 



= + (A3) 

To identify the model, E ( 0 ) = 0 , cr^ = 1 , and = 0 for all k as in the bi-factor model. In 

contrast, the variances cr^ are free parameters. The restrictions = 0 fix the rotational 

invariance of the testlet model. Similarly to the bi-factor model, the rotational invariance of the 
testlet model can be shown algebraically as follows: 

where «“ =«.^(l-c,), and +c,^J/(l-c,). 

Expressing Equation A3 as a model in which all latent variables have unit variance, Ei et 
al. (2006) showed that the testlet model is a bi-factor model in which the loadings on the specific 
dimension are proportional to the loadings on the general dimension within each testlet, 

= (A5) 



with = *^jg^k so that cr^. = 1 . 
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Appendix B 

Computations for the Bi-Factor Model 



First, it is shown how the complete data likelihood can be rewritten according to 
Equation 13. From Equations 9 and 10, 

E(y,z) = E(z)E(y|z) = E(zJn^(^^bJ^(y^b.’^^) 

k 

where F’fy^ j is the vector of responses to the items of testlet k. Then, 



P{y,^) = 






^K-\ 



K-\ 



[p^_ 

nf(z„z,)/'(y,|z,.z») 

[^(-1 

[^(^<) 



n/r-1 



resulting in Equation 13. 

It is now shown in detail how the junction tree can be used to obtain the marginal 
posterior probabilities |y^ for all k. A first step is to initialize the clique and separator 

potentials. 

For the bi-factor model, there are K cliques corresponding to the sets {yu, Zk, Zg), and K-l 
separators that all consist of the single variable Zg. All separator potentials are initialized with a 
value of 1 for all possible realizations of Zg. The clique potentials are initialized as follows. For 
each of the factors in Equation Bl, a clique is selected that contains all its variables. Specifically, 

the factors p(y^ |z^ , j and p(z^ |z^ j are associated with clique (y^;, Zk, Zg), for k =1,. ,.,K. 
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^ can be associated with any clique since Zg occurs in all cliques. Here, it is associated to 

clique (y^:, zk, Zg). Next, the factors of Equation B1 are multiplied into the cliques they are 
associated with. Hence, for the cliques (yk, Zk, Zg), k=l,..., K potential for every possible 
realization of Zk and Zg is defined as 

Both factors are computed straightforwardly given provisional estimates for the parameters that 
characterize |z^ , z^. j and j , and the observed response pattern y^. Obviously, the 

thus defined clique potentials will be specific for each observed response pattern y^. 

For clique (jk, Zk, Zg), the potential for every possible realization of zk and zg is defined 
as 






(B3) 



It is easily verified that the thus defined potentials obey they property stated in Equation 14 that 
the joint probability function can be expressed as the product of all clique marginals over the 
product of all separator marginals: 



P(y,z) = P(x) = 



ric^(*c) 



=ric^(*c)/i 

K 



k=\ 



k=\ 



(B4) 



Finally it is shown how the marginal posterior clique probabilities are obtained by 
applying the two-phase schedule of flows. During this process, the clique and state potentials are 
altered but the property that the joint probability function can be expressed as the product of all 
clique marginals over the product of all separator marginals is preserved. Without loss of 
generality, the schedule is illustrated for K=A. 
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Using the junction tree of Figure 4, the first flow is passed from clique (y^, Z4, Zg) to 
clique (y^, zs, Zg) as described in Equation 15. First, the potential of the separator in between the 
two cliques is updated. Denoting the two cliques by C 4 and C 3 respectively, and the separator by 
534 , 



^4-y4 



(B5) 






¥ 









and 






(xc3) = v(xc3)^(x,3j = E(z3|zjE(y3|z,,Z3)p(y4,zJ. 



Note that y 4 , as explained before, is treated for any given person as a merged node with only one 
possible realization, which is the observed response pattern of the person. It is easily verified that 
Equation 13 still holds, 

Ylc^M 



El 


(y4,zj 


|E| 


/ 

l^4 




”(y4 ^ 


, Z 4 ^ 


I'^E) 






1 


lx1xP( 


y4-^g 


) 



= E(zi|z,)E(yi|z,,Zi)E(z2|z,)E(y2|z,,Z2)E(z3|z,)E(y3|z,,Z3)E(z4|z,)^(y4|zg>24)^(^,) 

= ^(y,z)- 



The second flow is passed from clique C 3 to clique C 2 , over separator S23, where the 
potentials for C 3 have been updated according to Equation B5, 
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^*^3)= Z 

^ 3^‘^23 

43 .ys 

= /’(y4,zJ^(y3|^,) 

= ^(y3,y4>^«) 



using the conditional independence of and y4, given for the last equality; 



A(x, ),zj^, dy 3 .y 3 , 4 .) 

and 

V(xcJ = v(xcJ^(x,J = p(z2|zJp(y2|z,,Z2)p(y3,y4,zJ- 

The next flow concludes the collection phase, and results in the following potentials 

V*KJ= Z N/(xcJ 







Z^(^2 

^ 2 -y 2 


)^(y2 2g>22)^(y3’y4>^J 


^(yj.yj.z. 


)^(y2k) 


^'(y3k)^l 


^y3,y4 



= ^(y2>y3’y43^J 



using the conditional independence of y2, y3 and y4, given Zg for the last equality; 



x(^ _ p{y 2 ^y 3 ’y 4 ’Zg) 

v(vj 1 



'!''(v,) = v(>:c,)4(xsj = p(z,|z,)p(y,|z,,z,)p(y„y„y„zj. 



It is left to the reader to verify that Equation 13 remains valid after the second and third 
propagation of flows. 
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From the conditional independence given Zg between y2, ya, and y4 on the one hand and yi 
and zi on the other hand, 



= /’(y,z,,Zi) 



(B 6 ) 



Hence, the clique potentials for Ci are proportional to the marginal posterior probabilities of its 
latent variables zi and Zg. The exact posterior probabilities for the latent variables z\ and Zg are 
obtained through a renormalization of the clique potentials. 



/’(z,,z,|y) 



Zg,Zi 



(B 7 ) 



To obtain the marginal posterior clique probabilities for the remaining cliques, the second phase 
of flows has to be propagated. In this distribution phase, cliques are updated in reversed order. 
So, first a flow is passed from clique Cj to clique C2, over separator S12, 



^1-yi 

^ ^(y2-y3>y4-zj 



(y,\z,y 



and 






K ) = ¥(xcJ^(x,J = p(z2 |zjp(y2 |z^,Z2)p(y3,y4,z,)F(yi = F(y,Z2,zJ. 



Now, the clique potentials for C2 are proportional to the marginal posterior probabilities of its 
latent variables Z2 and Zg. 

For the last two flows, the clique and separator potentials are updated as, respectively. 
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C'2'^^23 

^2-y2 






= i’(y,zj, 

V(^s,,)_ P(y,z^) 



¥ 



K) ^’(y 3 .y 4 .z,) 



= P 



(yi,y 2 K)> 



and 



¥ 



K) = ¥(xc3)^(x,^3) = ^(z3h«)^(y3k>z3)^(y4>z,)^(yny2K) = ^(y.z3>z«)> 



and 



c,\s,, 

= llp(y^z„z^) 

Z3^Y3 

= p(y,z^), 



* 

¥ 




L 


¥( 




P(y 4 ,zj 



and 

X|/*(xcj = ¥(xcj^(x,j = p(zjp(z4|z,)p(y4|z,,z4)p(yi,y2,y3|z,) = ^(y,z4^zj- 



Posterior clique probabilities are obtained by normalizing the clique potentials analogously to 
Equation B7. 

It is again easily verified that the clique and separator potentials have reached 
equilibrium: no passage of a flow between any two cliques will alter the potentials of the 

separator or receiving clique. All the separator potentials are equal to P(y, Zg ) , which is also 



obtained after marginalizing any of the K clique potentials over Zk- Hence, the update factors 







are all equal to one, and the potentials are unaltered after passing a flow. 
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